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Abstract 

Recent advances in genome sequencing technologies provide unprecedented opportunities to characterize individual 
genomic landscapes and Identify mutations relevant for diagnosis and therapy. Specifically, whole-exome sequencing 
using next-generation sequencing (NGS) technologies is gaining popularity in the human genetics community due 
to the moderate costs, manageable data amounts and straightforv/ard interpretation of analysis results. While 
whole-exome and, in the near future, v^hole-genome sequencing are becoming commodities, data analysis still 
poses significant challenges and led to the development of a plethora of tools supporting specific parts of the ana- 
lysis workflow or providing a complete solution. Here, we surveyed 205 tools for whole-genome/whole-exome 
sequencing data analysis supporting five distinct analytical steps: quality assessment, alignment, variant identifica- 
tion, variant annotation and visualization. We report an overview of the functionality, features and specific require- 
ments of the individual tools. We then selected 32 programs for variant identification, variant annotation and 
visualization, which were subjected to hands-on evaluation using four data sets: one set of exome data from two 
patients with a rare disease for testing identification of germllne mutations, two cancer data sets for testing variant 
callers for somatic mutations, copy number variations and structural variations, and one semi-synthetic data set 
for testing identification of copy number variations. Our comprehensive survey and evaluation of NGS tools provides 
a valuable guideline for human geneticists working on Mendelian disorders, complex diseases and cancers. 
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INTRODUCTION 

Recent advances in genome sequencing technologies 
are rapidly changing the research and routine work of 
human geneticists. Due to the brisk decline of costs 
per base pair, next-generation sequencing (NGS) is 
now affordable even for small- to mid-sized labora- 
tories. Whole-genome sequencing and whole-exome 
sequencing have proven to be valuable methods for 



the discovery of the genetic causes of rare and com- 
plex diseases [1]. Although cheaper than Sanger 
sequencing, whole-genome sequencing remains ex- 
pensive on a grand scale. Over and above, one 
sequencing run provides enormous amounts of data 
and poses considerable challenges for the analysis and 
interpretation. In contrast, whole-exome sequencing 
is becoming a popular approach to bridge the gap 
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between genome-wide comprehensiveness and 
cost-control, by capturing and sequencing the ~1% 
of the human genome that codes for protein se- 
quences [2, 3]. Furthermore, novel, non-optical tech- 
nologies [4] are developing fast, and soon devices will 
be available that can sequence up to 10 Gbases at a 
fraction of the current costs. Thus, it is expected that 
by the end of the year 2012 whole-exome sequencing 
can be performed at a reagent cost per sample in the 
range of 400-500€ [5]. 

Whole-exome sequencing has already been used 
for identifying the molecular defects of single gene 
disorders [6, 7], for elucidating some genetically het- 
erogeneous disorders [8, 9] and for improving the 
accuracy of diagnosis of patients [10, 11]. The 
amount of both raw and processed data for whole- 
exome sequencing is orders of magnitude smaller than 
for whole-genome sequencing. Nevertheless, each 
sequencing run identifies a wealth of simple nucleo- 
tide variations (SNVs), including single-nucleotide 
polymorphisms (SNPs) and small insertions and dele- 
tions (INDELs). On average, whole-exome sequen- 
cing identifies 12 000 variants in coding regions [12], 
of which ~90% are found in publicly available data- 
bases [13]. In comparison, ~5 million variants, includ- 
ing 144 000 new variants, are reported on average by 
whole-genome sequencing [14]. 

Not surprisingly, initiatives have been started using 
whole-exome sequencing to explore all Mendelian 
disorders, which wlU improve the functional anno- 
tation of the human genome and will provide new 
insights into mechanisms of disease development 
[15]. As of to date, OMIM [16], a catalog of 
Mendehan disorders, lists >3000 disorders where 
the molecular basis has been reported. Still, >3500 
disorders are listed where the genetic cause remains 
unknown and has yet to be identified [17]. 

Other main fields of interest for human geneticists 
are complex diseases and cancer. For example, the 
usage of NGS has led to the identification of driver 
mutations for specific types of cancer [18, 19], which 
are often reported to be structural or non-coding 
[20]. NGS paved the way for the fundamental 
understanding of mutated genes in cancer cells, af- 
fected pathways, and how these data inform our 
models and knowledge of cancer biology [21]. A 
recent study with regard to tumor vaccination 
showed a proof-of-concept in which somatic muta- 
tions are first detected using NGS. Subsequently the 
inimunogenicity of these mutations is defined, and 
finally, mutations are tested for their capability to 



elicit T-cell inimunogenicity [22]. Thus, tailored 
vaccine concepts based on the genome-wide discov- 
ery of cancer-specific mutations and individualized 
therapy seem technically feasible. 

The current bottleneck of whole-genome and 
whole-exome sequencing projects is not the sequen- 
cing of the DNA itself but lies in the structured way 
of data management and the sophisticated computa- 
tional analysis of the experimental data [23]. In order 
to get meaningful biological results, each step of the 
analysis workflow needs to be carefully considered, 
and specific tools need to be used for certain experi- 
mental setups. 

The complete NGS data analysis process is com- 
plex, includes multiple analysis steps, is dependent on 
a multitude of programs and databases and involves 
handling large amounts of heterogeneous data. It is 
not surprising that due to the enormous success of 
NGS projects, a flood of tools has been created to 
support specific parts of the analysis workflow. It is 
apparent that the appropriate choice of tools is a 
non-trivial task, especially for inexperienced users. 
Therefore, a number of review articles were recently 
published (e.g. [24—28]) to facilitate the choice of the 
most suitable tool for a particular application. 
However, these articles review only selected compo- 
nents of the NGS data analysis pipeline such as map- 
ping and assembly [24], sequence alignments [26], 
algorithms for SNP and genotype calling [25] or de- 
tection of structural variations (SVs) and copy number 
variations (CNVs) [27]. To the best of our know- 
ledge, a comprehensive review covering all individual 
analysis steps has not been reported yet. Such a review 
would be tremendously helpful for researchers plan- 
ning a NGS project since it would provide a rich re- 
source to guide the assembly of an analytical pipeline 
for a particular application or altematively select a 
fuUy integrated pipeline. In addition, by covering 
multiple steps of the workflow it addresses issues 
such as data handling and tool compatibility, which 
are neglected when only individual components are 
reviewed. We therefore initiated this study to survey 
existing tools spanning the complete analysis work- 
flow and compile a comprehensive list of programs for 
variant analysis of NGS data. Additionally, we wanted 
to test a particular scenario typical for a human gen- 
eticist, i.e. applying tools in the context of mutation 
discovery on a Mendelian disorder or on a cancer data 
set. This type of hands-on evaluation of NGS soft- 
ware using real data sets rather than benchmarking the 
performance of a bioinformatics pipeline provides 
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additional criteria for making a decision to select the 
appropriate tools. 

We surveyed 205 tools for whole-genome/ 
whole-exome sequencing data analysis supporting 
five distinct analytical steps (Figure 1): quality assess- 
ment, alignment, variant identification, variant anno- 
tation and visualization. We report an overview of 
the functionality, features and specific requirements 
of the individual tools. We then selected 32 
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Figure I: Basic workflow for whole-exome and 
whole-genome sequencing projects. After library prep- 
aration, samples are sequenced on a certain platform. 
The next steps are quality assessment and read align- 
ment against a reference genome, followed by variant 
identification. Detected mutations are then annotated 
to infer the biological relevance and results can be dis- 
played using dedicated tools. The found mutations can 
further be prioritized and filtered, followed by valid- 
ation of the generated results in the lab. 



programs for variant identification, variant annota- 
tion and visualization, which were subjected to 
hands-on evaluation using four data sets: one set of 
whole-exome data from two patients with a rare 
disease for testing identification of germline muta- 
tions, two cancer data sets for testing variant callers 
for somatic mutations, CNVs and SVs, and one 
semi-synthetic data set for testing identification 
of CNVs. Commercial tools such as Avadis 
NGS (www.avadis-ngs.com), CLC Genomics 
Workbench (www.clcbio.com), DNAnexus (dna- 
nexus.com). Ingenuity Pathways Analysis (www. 
ingenuity.com), NextGENe (www.softgenetics. 
com/NextGENe.html), Partek Genomics Suite 
(www.partek.com) or SNP and Variation Suite 
(www.goldenhelix.com) were not part of this 
sui"vey. 

The article is structured as follows. We first intro- 
duce the main application fields for NGS in human 
genetics, namely, Mendelian diseases, i.e. single gene 
disorders as a result of a single mutated gene like 
cystic fibrosis or sickle cell anemia, complex diseases, 
i.e. polygenic diseases associated with the effects of 
multiple genes in combination with other factors like 
diabetes or hypertension and cancer. Next, we de- 
scribe NGS platforms and the NGS data analysis 
workflow and review available tools for the distinct 
analysis steps. Then we explain the scope and the 
method used to evaluate the tools and report 
hands-on experience of the selected programs. 
Finally, we make general recommendations for tool 
selection and prioritization of candidate genes. 

APPLICATION OF 
NEXT-GENERATION GENOME 
SEQUENCING IN HUMAN 
GENETICS 

We considered three common scenarios for human 
geneticists using NGS data: (i) identification of causa- 
tive genes in Mendelian disorders (germline muta- 
tions), (ii) identification of candidate genes in 
complex diseases for further functional studies and 
(iii) identification of constitutional mutations as 
well as driver and passenger genes in cancer (somatic 
mutations). 

Mendelian disorders 

Traditionally, the main approach to elucidate causes 
of Mendehan disorders has been positional cloning 
based on linkage analysis [29] . The resulting genomic 
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interval should normally contain <300 candidate 
genes which are further investigated by Sanger 
sequencing [13]. This study design is only applicable 
for familial diseases where an appropriate sample size 
is available [30] and is not suitable for the identifica- 
tion of de novo dominant mutations. 

Currently, OMIM hsts ~3500 Mendelian dis- 
orders with unknown genetic causes [17]. Whole- 
exome sequencing is a powerful tool that has not 
only revolutionized comprehensive candidate gene 
sequencing in traditional positional cloning studies 
but also allowed identification of autosomal recessive 
disease genes in single patients from non- 
consanguineous families (e.g. [6, 7, 31]) as well as 
de now dominant mutations. As whole-exome 
sequencing identifies a vast amount of variants, 
sophisticated filtering approaches are needed to 
reduce the number of genes for further investigation. 
Furthermore, as current capturing methods cannot 
evenly capture exonic regions [32], potentially inter- 
esting mutations in these regions are possibly 
neglected. 

Whole-genome sequencing provides a complete 
view of the human genome, including point muta- 
tions in distant enhancers and other regulatory elem- 
ents which have been previously associated with 
hereditary diseases [33]. As the cost per sequenced 
base wiU likely drop in the future, whole-genome 
sequencing wiU presumably replace whole-exome 
sequencing. 

Complex diseases 

The genetics of complex phenotypes have been 
investigated for decades through association studies 
with candidate genes that, based on pathophysio- 
logical considerations, were suspected to be involved 
in the development of the phenotype [34]. This ap- 
proach was severely hampered by relying on some- 
times unfounded functional hypotheses as well as by 
applying wrong statistical assumptions. An altemative 
to this candidate gene approach are genome -wide 
association studies (GWASs), which have become 
more feasible through the advancement of high- 
throughput genotyping technologies. GWASs are 
based on the principle of linkage disequilibrium — 
the non-random association between alleles at difiier- 
ent loci — at the population level [35] . The develop- 
ment of SNP arrays, which can genotype many 
markers in a single assay in conjunction with bio- 
banks of either population cohorts or case— control 
samples, facilitated the ability to conduct GWAS. 



This unbiased survey of many genes and variants ro- 
bustly identified associations between 1300 loci and 
200 diseases or traits [36]. 

Genetic studies of complex phenotypes are based 
on either 'common disease— common variant' or 
'common disease— rare variant' hypotheses. GWAS 
primarily test the 'common disease— common variant' 
hypothesis, where complex phenotypes are the result 
of cumulative effects of a large number of conmion 
variants. In contrast, the 'conmion disease— rare vari- 
ant' hypothesis posits that multiple rare variants with 
large effect sizes are the main determinants of herit- 
abHity of the disease [34]. The field is now shifting 
toward the study of lower frequencies of rare variants 
[37], which can only be empowered by NGS and 
sophisticated bioinfomiatics approaches [38]. 

Defining the genetic basis of complex diseases 
using NGS can be perfomied by the following: (i) 
whole-genome, (ii) whole-exome and (iii) targeted 
subgenomic sequencing. Whole-genome and 
whole-exome sequencing have been successfully uti- 
lized to identify the genes responsible for complex 
hereditary diseases [39, 40], where whole-genome 
sequencing enables testing of both mentioned 
hypotheses. 

Finally, NGS can also be used to identify trait loci 
by re-sequencing candidate genes in a large number 
of patients and controls as demonstrated for Type 1 
diabetes [41]. This targeted subgenomic sequencing 
is likely to be supplanted by whole-exome (or 
whole-genome) sequencing in the near future. The 
challenge is now to utilize sequencing to enable the 
discovery of novel genes that contribute to the stu- 
died diseases. Given the vast number of genetic and 
non-genetic etiological factors of complex diseases, 
the ultimate approach will require exploiting biolo- 
gical and clinical data, and integration of additional 
data sets including RNA sequencing data, prote- 
omics data and metabolomics data. 

Somatic mutations 

For human geneticists, there is an important distinc- 
tion between constitutional and somatic mutations. 
Constitutional mutations, which have been inherited 
from the parents, are present in all cells of the body 
and may increase the susceptibility of an individual to 
be diagnosed with cancer [42]. New methods have 
led to the identification of novel genetic components 
which may improve assessment of cancer predispos- 
ition [43]. Indeed multiple cancer susceptibility loci 
have been identified, which indicate that there may 
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be a significant number of common alleles that con- 
tribute to the heritability of a specific cancer. 
However, each of these loci confers only a small 
contribution to the risk for cancer [44]. For example, 
22 common breast cancer susceptibility loci have 
been reported which explain only ~8% of the dis- 
ease's heritabUity [45]. For this reason, it is daunting 
to transfer these common alleles, which on their own 
have only a minor impact on disease risk, into a 
clinical test [43]. In contrast, according to the 
'common disease— rare variant' hypothesis [46], con- 
stitutional high-penetrance mutations in specific 
genes may cause a high risk of developing particular 
types of cancer. Whole-exome sequencing has led to 
the identification of high-penetrance mutations in 
other genes that are only relevant for a small propor- 
tion of families [47]. 

There is considerable interest in the exploitation 
of somatic mutations as a tool used to improve the 
detection of disease and, ultimately, to allow indivi- 
dualized treatment leading to better outcomes. The 
aim is to give patients a drug tailored to the genetic 
makeup of their tumor. The most famous example is 
the Bcr-Abl tyrosine kinase inhibitor imatinib, which 
represents a major therapeutic advance over conven- 
tional therapy in patients with Philadelphia chromo- 
some positive chronic myelogenous leukemia. Here, 
>90% of patients obtained complete hematologic re- 
sponse and 70—80% of patients achieved a complete 
cytogenetic response [48]. In colorectal cancer, the 
paradigms are 'KRAS' mutations in exon 2 (codons 
12 and 13), which have been established as a predict- 
ive marker for treatment with epidermal growth 
factor receptor inhibitors [49]. Therefore, it has 
become increasingly urgent for the clinical oncolo- 
gist to have access to accurate and sensitive methods 
for the detection of such predictive biomarkers. 



NGS VARIANT ANALYSIS 

WORKFLOW 

NGS platforms 

NGS instruments provide higher throughput at an 
unprecedented speed by sequencing millions of short 
DNA fragments in parallel [50, 51]. Currently, the 
three most commonly used platforms are Roche 454 
(introduced in 2005), lUumina (launched in 2006) 
and ABI SOLID (followed m 2008). AH three plat- 
forms sequence DNA by measuring and analyzing 
signals, which are emitted during the creation of 



the second DNA strand, but differ in how the 
second strand is generated. 

In order to produce detectable signals, template 
DNA is fragmented into small pieces, ampHfied and 
immobilized on a glass slide before sequencing. 
Roche 454 implements pyrosequencing, which 
measures released pyrophosphates allowing the ana- 
lysis of read fragments up to a few hundred base 
pairs. Since this technique infers the number of 
incorporated nucleotides from the signal's intensity, 
the system experiences problems when homopoly- 
mer stretches longer than 8 bp are sequenced [52]. 
This complicates identification of small insertions and 
deletions. lUumina appHes a sequencing-by-synthesis 
approach where only 1 nt per sequencing cycle is 
incorporated using reversible dye terminators. 
Thereby, it avoids homopolymer calling problems 
at the cost of being capable of sequencing only 
shorter fragments. ABI SOLID analyzes DNA by 
ligating fiuorescently labeled di-base probes to the 
first strand, requiring reading each base twice. Due 
to the nature of this approach, identified calls are not 
stored in nucleotide but in color space — a property 
that needs to be considered in downstream analyses. 

Depending on library preparation and sequencing 
technology, it is possible to sequence reads that are of 
a known chromosomal distance [26] . These so-called 
paired-end or mate-pair reads provide additional in- 
formation which can be used for enhancing mapping 
accuracy and identifying structural rearrangements 
[53]. 

After completing laboratory work and the actual 
sequencing, the researcher is confronted with a huge 
amount of raw data. The analysis of the data can be 
decomposed into five distinct steps (Figure 1): (i) 
quality assessment of the raw data, (ii) read alignment 
to a reference genome, (iii) variant identification, (iv) 
annotation of the variants and (v) data visualization. 
In the following paragraphs, we briefly explain each 
of the steps and review available software tools. The 
initial list of analysis tools was acquired by perform- 
ing multiple PubMed searches. Furthermore, we 
conducted additional Internet searches to identify 
tools not indexed by PubMed. An overview of the 
surveyed tools is given in Supplementary Tables (see 
also http://icbi.at/ngs_survey). 

Quality assessment 

The first analysis step after completing the sequen- 
cing run is to evaluate the quality of raw reads and to 
remove, trim or correct reads that do not meet the 
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defined standards. Raw data generated by sequen- 
cing platforms are compromised by sequence artifacts 
such as base calling errors, INDELs, poor quality 
reads and adaptor contamination [54]. These errors 
are quite common in sequencing data, and platforms 
are susceptible to a wide range of chemistry and in- 
strument failures [55, 56]. As many downstream ana- 
lysis tools are not capable of checking for low-quality 
reads, it is necessary to perform filtering and trim- 
ming tasks in advance to avoid drawing of wrong 
biological conclusions. Generally, these steps include 
visualization of base quality scores and nucleotide 
distributions, trimming of reads and read filtering 
based on base quality score and sequence properties 
such as primer contaminations, N content and GC 
bias. 

Several tools have been developed to perform the 
various stages of quality assessment (Supplementary 
Table SI shows 11 selected tools). The stand-alone 
tools NGSQC Toolkit [54] and PRINSEQ [57] are 
able to handle FASTQ and 454 (SFF) files, produce 
summary reports and are capable of filtering and 
trimming reads. FastQC (http://www.bioinfor 
matics.bbsrc.ac.uk/projects/fastqc) is compatible 
with all main sequencing platforms and outputs sum- 
mary graphs and tables to quickly assess the data 
quaHty. The tool ContEst [58] can be used to esti- 
mate the amount of cross-sample contamination in 
NGS data. Galaxy oifers an integrated tool [59] that 
creates FASTQ summary statistics and performs flex- 
ible trimming and filtering tasks. htSeqTools [60] and 
SolexaQA [55] include quality assessment, processing 
and visualization functionality. In addition, software 
tools have been published that support only the 
lUumina platfomi (i.e. FASTX-Toolkit (http://han 
nonlab.cshl.edu/fastx_toolkit), PIQA [61] and 
TileQC [62]) or provide certain specialized function- 
ality (i.e. TagCleaner [63]). 

Alignment 

After reads have been processed to meet a certain 
quality standard, they are usually aligned to an exist- 
ing reference genome [25]. Currently, there are two 
main sources for the human reference genome 
assembly: the University of Santa Cruz (UCSC), 
which is also hosting the central repository for 
ENCODE data [64] and the Genome Reference 
Consortium (GRC), which focuses on creating ref- 
erence assemblies [65]. Both resources provide sev- 
eral versions of the human genome. UCSC oflers 
versions hgl8 and hgl9 while GRC provides 



GRCh36 and GRCh37. Together these are the 
most widely used reference genomes. UCSC (hg) 
and GRC (GRCh) human assemblies are identical 
[66] but differ with regards to their nomenclature 
(e.g. UCSC uses a 'chr' prefix). 

Over the last years, many aUgnment programs 
have been developed [67] to efficiently process mil- 
lions of short reads and include, among others, 
Bowtie/Bowtie2 [68, 69], BWA [70, 71], MAQ 
[72], mrFAST [73], Novoalign (http://novocraft. 
com), SOAP [74], SSAHA2 [75], Stampy [76] and 
YOABS [77] (Supplementary Table S2). Since re- 
views about characteristics and properties of align- 
ment methods have been published elsewhere [26, 
67, 78], we do not report a comprehensive list but 
rather a shorter list of 17 commonly used tools and 
refer to the literature for an in-depth review of the 
methods and the corresponding tools. 

The sequencing technologies are constantly push- 
ing the lengths of generated reads — requiring new 
and improved algorithms. First-generation short-read 
aligners were often optimized for ungapped align- 
ment whereas today's programs can deal with 
longer read lengths and gaps. The majority of cur- 
rently available long-read alignment algorithms may 
be classified as either using hash table indexing, like 
in BEAT [79] or in SSAHA2 or using some sort of 
compressed tree indexing based on the Burrows- 
Wheeler transform [80]. Most alignment algorithms 
foUow the seed-and-extend paradigm, where one or 
more of so-called seeds are searched followed by an 
extension to cover the whole query sequence [26]. 

Additionally to the selection of the alignment pro- 
gram, three issues are noteworthy. First, to overcome 
the problem of ambiguity when mapping short reads 
to a reference genome, paired-end reads have proven 
to be a valuable solution and are highly recom- 
mended, if not even a requirement for whole-exome 
sequencing and whole-genome sequencing [81]. 
Second, reads that can only be mapped with many 
mismatches should not be considered and as a con- 
sequence, mutations that are only backed by such 
reads should be discarded from further analysis. 
And third, as current NGS technologies incorporate 
PGR steps in their library preparations, multiple 
reads originating from only one template might be 
sequenced, thereby interfering with variant calling 
statistics. For that reason, it is common practice to 
remove PGR duplicates after alignment in 
whole-genome and whole-exome sequencing 
studies. 
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Variant identification 

A crucial part of next-generation genome sequen- 
cing data analysis is the identification of variants. 
The choice of applied strategies for genotype-calling, 
somatic mutation identification and SV exploration 
is ultimately related to the data usage. It is of utmost 
importance to first carefully design the study as it 
ultimately affects analysis and testing strategies [82]. 
In addition, sequence coverage is an important factor 
in variant identification as called mutations should be 
supported by several reads [83]. Finally, appropriate 
tools for analyzing NGS data [25] have to be thor- 
oughly evaluated. 

Tools for genome-wide variant identification can 
be grouped into four categories: (i) germline callers, 
(ii) somatic callers, (iii) CNV identification and (iv) 
SV identification. The detection of germline muta- 
tions is a central part for finding causes of rare dis- 
eases. Cancer studies focus on the identification of 
somatic mutations by comparing sequencing results 
of tumor/normal pairs from one subject. The tools 
for the identification of large structural modifications 
can be divided into those which find CNVs and 
those which find other SVs such as inversions, trans- 
locations or large INDELs. CNVs are currently the 
only SVs that can be detected in both whole- 
genome and whole-exome sequencing studies. 
Therefore, dedicated tools have been designed that 
consider the properties of exome capturing [84]. 

We surveyed 63 tools for variant identification and 
provide a Hst for each of the four categories including 
input/output formats, supported platform, types of 
detectable variants and usage notes (Supplementary 
Tables S3-S6). 

Variant annotation 

With the large amount of data produced by NGS 
experiments, the possibility of predicting the func- 
tional impact of variants in an automated fashion is 
becoming increasingly important. Computer-aided 
annotation enables research groups to filter and pri- 
oritize potential disease-causing mutations for further 
analysis. The available tools implement different 
methods for variant annotation. Most of them 
focus on the annotation of SNPs, since they can be 
easily identified and analyzed. INDELs are also cov- 
ered by some tools, whereas annotation of structural 
variants is Umited to CNVs and only performed by 
recently developed applications. 

The most common form of annotation is to pro- 
vide database links to various public variant databases 



such as dbSNP. In terms of functional prediction of 
the variants, the tools employ different approaches, 
ranging from simple sequence-based analysis over 
region-based analysis to the evaluation of the struc- 
tural impact on proteins. The result of the functional 
analysis is a classification into accepted and deleteri- 
ous mutations. The programs often have more 
fine-grained risk classes or scores reflecting the Hke- 
lihood of a deleterious effect. 

Many annotation tools are provided as web appli- 
cations, which have the advantage that there is no 
need for installing or maintaining a local copy. 
Usually, applications using web interfaces are easy 
to use and self-explanatory. However, users are de- 
pendent on the availabihty and the performance of 
the provided services. Another issue is that many of 
the online tools do not support batch submission of 
variants, thus making them only viable for manual 
analysis of a small set of variants. Moreover, legal 
issues might arise since most services do not guaran- 
tee data confidentiality. On the other hand, offline 
tools usually provide more flexibility and are not 
dependent on the availability of any specific web- 
service, but require the user to have a certain 
degree of IT skills. The inspected annotation tools 
(74) with their respective input/ output formats, sup- 
ported variants, type [graphical user interface (GUI), 
client or web tools] and usage notes are given in 
Supplementary Table S7. 

Visualization 

An important and challenging step in every NGS 
data analysis workflow is the validation and visual- 
ization of the generated results. Visual representation 
of data can be tremendously useful for the interpret- 
ation of obtained results. Therefore, NGS visualiza- 
tion tools should support users by displaying aligned 
reads, mapping quality and identified mutations 
combined with annotations from various public re- 
sources. In addition, the tools should be user- 
friendly, intuitive and responsive. 

Visualization tools for genomic data can be 
divided into three different types [85]: (i) finishing 
tools supporting the interpretation of sequence data 
of de novo or re-sequencing experiments, (ii) genome 
browsers that allow users to browse mapped experi- 
mental data in combination with different types of 
annotation and (iii) comparative viewers that facili- 
tate the comparison of sequences from multiple or- 
ganisms or individuals. In addition to genome 
browsers, software suites have been published 
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which enable visuaHzation of identified CNVs and 
SVs [86-88], showing a global picture of genomic 
rearrangements and SVs. A list of 38 surveyed 
genome browsers and two CNV/SV visualization 
tools can be seen in Supplementary Tables S8 
and S9. 

Genome browsers can be divided into two major 
types: web-based appHcations running on a dedicated 
web server [89] and stand-alone tools, where most of 
them use a GUI. As all genome browsers with a GUI 
are implemented in Java, they can be used on plat- 
forms such as Windows, Mac and Linux systems. 

The main advantage of web-based genome brow- 
sers is the support of a variety of annotations. The 
user can browse reference genomes as well as differ- 
ent types of genomic annotations derived from a 
variety of public databases. Furthermore, users do 
not need to install new applications with numerous 
dependencies, and computational intensive calcula- 
tions are performed on the server. A drawback of the 
web-based genome browsers is the necessity of up- 
loading the data to a remote server, which poses 
security and legal issues. 

Stand-alone genome browsers offer interactive 
browsing and zooming features, which are missing 
in some web-based genome browsers. Furthermore, 
they do not require uploading the data to websites. 
Shortcomings include the need to download anno- 
tation files and the user's responsibility to keep 
annotations up-to-date. In addition, complex calcu- 
lations have to be performed by regular desktop PCs, 
which may not be powerful enough to deal with this 
workload. 

When interpreting aligned sequences using a 
genome browser, it is recommended to consider sev- 
eral important aspects [90]. Reads, which could be 
mapped with many mismatches should not be 
trusted and mutations, which are only backed by a 
small fi^action of reads should be discarded. 
Moreover, reads should only be trusted for further 
processing if they align at a unique starting position. 

Analytical pipelines and 
workflow systems 

As can be seen from this review, the scientific com- 
munity has access to a plethora of tools for NGS 
analysis. Combining these methods for analysis to 
obtain biologically meaningful results is still a chal- 
lenging task even for experienced users. A viable 
alternative is the use of complete analytical pipelines 



capable of analyzing all steps starting from raw se- 
quences to a set of identified and annotated variants. 

The analytical pipelines in general have a prede- 
fined order of analysis steps and built-in algorithms 
that cannot be easily modified and/or replaced. In 
contrast to pipelines, workflow management systems 
offer the flexibility to arrange a specific order of ana- 
lytical steps and execute a series of data manipulation 
or analysis steps. Most existing systems provide GUI 
allowing the user to build and modify complex 
workflows with little or no programming expertise. 

We identified 13 pubhshed pipelines and 12 avail- 
able workflow systems suitable for NGS analysis, 
which process data from various platforms using dif- 
ferent tools (see Supplementary Tables SIO and Sll, 
respectively). 

SCOPE AND METHOD OF 
EVALUATION OF NGS TOOLS 

The wealth of available tools reflects the importance 
of NGS and the tremendous dynamic of the field of 
data analysis. Of the five distinct analytical steps, 
quality assessment and read alignment can be con- 
sidered as matured and robust. Variant identification, 
variant annotation and visualization methods are 
essential for the detection of relevant variants and 
are still being developed. We therefore assessed se- 
lected tools for the latter three categories by installing 
and testing them. 

Criteria for tool selection 

We considered only published, freely available and 
constantly maintained tools for in-depth evaluation. 
To be classified as maintained, the tool had to be 
updated within the last year (cutoff date September 
2011) since the analysis of NGS data is a 
fast-evolving field and tools need to be constantly 
adapted. A further requirement for tool selection 
was the support of accepted standard input and 
output formats. Additionally, due to the heterogen- 
eity of tools available for different parts of the analysis 
workflow, we specified distinct selection criteria for 
each category as described below. 

Variant identification 

Gemiline and somatic variant callers were selected if 
they were able to: (i) use Sequence Alignment/Map 
(SAM) [91], Binary Alignment/Map (BAM) or 
pileup format as input and (ii) provide output results 
in the variant call format (VCF). Tools that 
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lable I: Variant identification 



Name OS BAM/SAM Other Inputs Output Identifies Data set Result" 

Input 



Germline callers 



CRISP 


Lin 




Yes 


- 


VCF 


SNP, INDEL 


KTS 


24 034SNPS, 259 INDELs 


GATK (UnifiedGenotyper) 


Lin 




Yes 




VCF 


SNR INDEL 


KTS 


49476 SNPs, 1959 INDELs 


SAMtools 


Lin 




Yes 


FASTA 


VCF 


SNR INDEL 


KTS 


21 852 SNPs, 332 INDELs 


SNVer 


Lin, 


Mac, Win 


Yes 


_ 


VCF 


SNR INDEL 


KTS 


22105 SNPs, 234 INDELs 


VarScan 2 


Lin, 


Mac, Win 


No 


pileup/mpileup 


VCF, 

VarScan CSV 


SNR INDEL 


KTS 


34984 SNPs, 1896 INDELs 


Dmatic callers 


















GATK 


Lin 




Yes 




VCF 


INDEL 


WES 


151 INDELs 


(Somaticlndel Detector) 


















SAMtools 


Lin 




Yes 


FASTA 


BCF 


SNR INDEL 


WES 


Canceled*" 


SomaticSniper 


Lin 




Yes 




VCF, somatic 
sniper output 


SNR INDEL 


WES 


6926 SNPs 


VarScan 2 


Lin, 


Mac, Win 


No 


pileup/mpileup 


VCF, 

VarScan CSV 


SNR 

INDEL, CNV 


WES 


1685 SNPs, 324 INDELs 


NV identification tools 


















CNVnator 


Lin 




Yes 


FASTA 


CSV 


CNV 


cnviim 


39 CNVs 


RDXplorer 


Lin, 


Mac 


Yes 


FASTA 


CSV 


CNV 


cnviim 


4 CNVs' 


CONTRA 


Lin, 


Mac 


Yes 


FASTA 


VCF, CSV 


CNV 


WES 


3 CNVs 


ExomeCNV 


Lin, 


Mac, Win 


Yes 


pileup + 
BED + FASTA 


CSV 


CNV LOH 


WES 


137 CNVs 


V identification tools 


















BreakDancer 


Lin, 


Mac 


Yes 


config file 


CSV BED 


INDEL, INV 

TRANS, 

CNV 


WGS 
(tumor 
+ normal) 


6219 DELs, 0 INSs, 
7 INVs, 17 303 ITX, 
5037 CTX 


Breakpointer 


Lin 




Yes 




GFF 


INDEL 


WGS 
(tumor) 


d 


CLEVER 


Lin 




Yes 


FASTA 


CLEVER 
format 


INDEL 


WGS 
(tumor) 


d 


GASVPro (GASVPro-HQ) 


Lin, 


Mac 


Yes 




clusters file 


INDEL, INV 
TRANS 


WGS 
(tumor) 


2529 DELs, 207 INVs 


SVMerge 


Lin 




Yes 


FASTA 


BED 


INDEL, 




Aborted'' 



INV CNV 



Four different types of tools for variant identification can be distinguished: germline callers, somatic callers, CNV identification and SV identification 
tools. Listed are the results of the tested applications (4, 2, 3 and 5, respectively). All surveyed applications are listed in Supplementary Tables 
S3-S6. 'SNVs are counted based on their position but in a sequence independent manner ""Somatic mutation calling with SAMtools was canceled 
due to unclear definition of tumor and normal files. Furthermore, we were not able to find the CLR field in the resulting vcf file, which should hold 
the Phred-log ratio between the likelihood by treating the two samples independently, and the likelihood by requiring the genotype to be identical. 
'For RDXplorer the filtered result data set was used. CLEVER and Breakpointer created result files with >2.6 million lines, which need to be fur- 
ther processed. 'Installation was aborted due to unreasonable dependencies. OS, operating system; Lin, Linux; Mac, Mac OS X; Win, Windows; 
BAM, Binary SAM; BED, Browser Extensible Data, a text-based file format; CSV, comma separated values; FASTA, text-based format for represent- 
ing nucleotide sequences; GFF, general feature format; mpileup, multisample pileup; pileup, text-based format representing base-pair information 
at each chromosomal position; SAM, Sequence Alignment/Map; VCF, Variant Call Format; CNV, copy number variation; CTX, inter-chromosomal 
translocation; DEL, deletion; INDEL, insertion/deletion; INS, insertion; INV, inversions; ITX, intra-chromosomal translocation; LOH, loss of hetero- 
zygosity; SNP, single-nucleotide polymorphism; SNV, simple nucleotide variant; SV, structural variant; TRANS, translocations. 



determine CNVs and SVs were selected when they 
accepted SAM/BAM as input format. Table 1 lists aU 
tools selected for testing. 

Variant annotation 

For further consideration, annotation programs 

were required to: (i) accept VCF as input format 

(including tools that offer converters) and (ii) 



integrate results from other software (Table 2). 
Furthermore, results had to be reported in an 
output file, and web-based tools needed to provide 
a batch submission system. 

Visualization 

Selection criteria were the availability of a GUI and 
the support of VCF, SAM and BAM, as these 



Survey of tools for variarit arialysis of r)ext-ger)eratior) genome sequer)c'mg data 



265 



3 
Z 



CO 

Q 



D 
> 

z 

u 



lu 
Q 
Z 



Z 

(/) 



3 

o 



3 

a. 
c 



o 



z 



o o 



Z ^= 



o 
U 



+ 



X 



X 



b a 

Q. O ^ ^ a. 

o- -i £ 5 a. 

u: E C2 < ^5 u: 

(j o Li- o *< U 

> u o bi u > 



(H O 

„ - u 



O Q. I 



o 

I U 



> 



tr, = 



(/I (O VI 



to 



llT 



X * 

LLT 1- "!> 

> 5 u 



X 



o 

Q- 



1^ 



< 

cLX 



> 



5 ^ 

- r- 

Q- Z ^ ^ 

Ll_' rt LL H Ll_' Q. LL O LiT ^~ E 

(jcQu<U«!UUUUX a) 



O a. 



1^ 

i 



ij -J 5 -J -J i -J 



< 

< LU 
> > 



aj n O 



a ^ a 



»-> _J L/^ 



0) O LU 



"2 



S g B 
-2 g S 



t-> O (U 



c c 0) > 

•- O 1/1 0) 

Si " ^ 

-* c c 3 

, . ^ c 0) 



u < 



rl ^ 1^ 
M ■- 



i 5 £ 

! O o 



O O 



< 

.O 



E 
o 
i- 

t-> 

a 

I > 



c ^ l/l 



> T3 5 

Ji « So 
"? <" < 



52- 

Q in 



CL O .E 
o E 

LJ ^ 4J 

<M ^ E 



" s ^ 'S 

4J (J — I ) i-t 

S i ^ ^ S 

rt Q_ — ij' Wl 
^ "° B I 1 

I S E ^ V 

= § M L 

5! > 'J = 

> S — r M o 

> s M a 

0) ti ^" .£ L 

1 -s ^ 2 ^ 

^ LU 5 MU 1 

-Mil' 

S ° X s £ 

^ " o = 

« £ I .- 3j 



- o Si 

L_ t£ 4-1 

1 s .= 

0) rt O 

2 E _^ 

c jq re 

re o 

!_ c - 

re o y 



re o 



-5 .2 -5 



Li_ rt rt 



^ (5 §S 

•- <u t; 

- o i-l 

. n U J! 

>■ t9 

E S != 

S Z „ 

v> to X 

^ <^ LU 



Q re a: 
l£ E -2 



3 y 

Lll y 



0) 



£ ^ 

^ i 

CL o ^ 

° E Q 

LO LU 
1/1^4-1 

-g Q. 



X > 

g Co 

^ <!' 

to 



Ll_ >. rt Wl 
( n 1-1 p rt 



i_ rt O "O Q) 

J! 3 S o -o 

^ _y jrt _a> -o 

"Ore' lire 

° ^ " ^ 
g O S .3i .6■ 
re re E .E 3 

9 "5_ "q. Q-* Wl 

^: rt -t ^ 



^ re ^ 



c/l O 
re L/l 



.2 E 



o 
U 

> ^ 

U g 



i3 O 
re M 



o 



-S 2 i O ■> -5 



266 



Pabinger et al. 



formats are considered a de facto standard in the field. 
We did not consider finishing tools, as they are pri- 
marily designed for de novo sequencing projects as 
well as comparative viewers and genome browsers 
that only supported viewing of aligned reads. 

Analysis pipelines and workflow systems 

The quality assessment step is NGS platform specific 
and it is not necessarily a part of an analytical pipeline 
or workflow system. Therefore, pipelines and work- 
flow systems were selected for close inspection if 
they cover the analysis steps of read alignment, vari- 
ant detection and variant annotation. 

Test data sets and hardware 

In order to test the software suites, we selected four 
data sets. The first data set (KTS) contains two sam- 
ples and is a paired-end whole-exome sequencing 
run used for the elucidation of the Kohlschiitter— 
Tonz syndrome [92]. Reads were aligned to the 
UCSC reference human genome assembly (hgl9) 
using the sequence alignment software BWA version 
0.5.10 [71] using the default parameters. 

For the second data set, we randomly selected one 
of the simulated artificial tumor data sets published 
by [93] (csv_sim). This data set (14.7 million reads) is 
based on chromosome 22 (UCSC reference human 
genome assembly hgl8), where 42 CNVs with set 
sizes of 100, 500bp, 1, 5, 10, 50 or lOOkb were 
simulated. The copy number of the CNV segments 
had been chosen to represent either a homo- or het- 
erozygous deletion or up to a four-copy gene gain. 

The third data set is an example of a whole-exome 
sequencing study containing paired-end data from 
germline and tumor samples derived from a 
42 -year-old Hispanic female with mast-cell leukemia 
(WES) (SRP008740 [94]). Tumor-genomic DNA 
originated fi-om bone marrow biopsies, whereas the 
germline sample was derived from saliva of the same 
patient. Both samples were aligned to UCSC refer- 
ence human genome assembly (hgl9) using BWA 
version 0.5.10. 

For testing SVs, a whole-genome sequencing data 
from a liver metastatic lung cancer samples obtained 
from lung adenocarcinoma patients (ERP001071, 
[95]) was selected (WGS). We randomly chose one 
tumor/normal pair (subject AK55) sequenced on an 
lUumina HiSeq 2000 of the available data sets. To 
enable comparison with the study's results, read 
alignment was performed to UCSC reference 
human genome assembly (hgl9) using the alignment 



program GSNAP [96] version 2012-07-20 and de- 
fault parameters. 

All tools have been tested with default parameters 
on the same server system to ensure comparable re- 
sults between the test runs. This system consists of an 
HP Proliant DL580 G7 server, equipped with four 
Intel E7-4870 CPUs and 512 GB of main memory. 
This results in 40 CPU cores at a tact rate of 2.4 GHz 
each and 12.8 GB of memory per core. Data storage 
was provided by a hybrid system of both 900 GB of 
directly attached high-performance hard drives and 
30 TB of network attached storage via a 6 GB/s fiber 
channel link. On the software side, CentOS 6.2 was 
used with all packages on their latest versions. AH 
Python tools have been tested with a manually com- 
piled version of Python 2.7.3. R scripts were exe- 
cuted with R version 2.15.1 [97]. 

EVALUATION RESULTS 

In this section, we present the evaluation results for 
the three essential NGS data analysis steps (Figure 1): 
(i) variant identification, (ii) variant annotation and 
(iii) visualization. Additionally, we closely describe 
existing pipelines integrating read alignments, variant 
identification and annotation. We were able to install 
and run all selected tools (in total 32 programs). It is 
noteworthy that the installation and maintenance of 
the majority of the tools and/or pipelines require 
certain degree of IT expertise, which is usually pro- 
vided by the local IT department. Below we briefly 
describe the tools and provide additional information 
of the software in Supplementary Information II. 

Variant identification 

The evaluated tools for variant identification were 
divided into four groups (Table 1): gemiline callers 
(five tools), somatic callers (four tools), CNV identi- 
fication tools (four tools) and SV identification tools 
(five tools). 

Germline callers 

'CRISP' [98] is a tool for identifying SNPs and 
INDELs from pooled NGS data. It is intended to de- 
tect both rare and common variants. Furthermore, it 
has been specifically designed to detect variants from 
pooled data and should not be used for analysis of 
single samples. 

'GATK' [99] is a software library that provides a 
suite of tools for working with human data, includ- 
ing depth of coverage analyzers, a quaUty score 
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recalibrator, a local realigner and a SNP/INDEL 
caller. The authors offer extensive documentation 
and an informative wiki system. In addition to calling 
germline variants, GATK can be used to identify 
somatic mutations. 

'SAMtools' [91] is a versatile collection of tools for 
manipulating SAM and BAM files. It contains a 
subset of commands called BCFtools. BCFtools has 
also the ability to call SNPs and short INDELs from a 
single alignment file. Furthermore, SAMtools can be 
used to call somatic mutations from a pair of samples. 

'SNVer' [100] is an operating system independent 
statistical tool for the identification of SNPs as well as 
INDELs, in both pooled and individual NGS data. 
Recently, SNVerGUI [101] has been published, 
which provides a GUI version of SNVer tool. 

'VarScan 2' [102] is a platfomi-independent pro- 
gram that can be used to identify gemiline variants, 
as well as shared and private variants. In addition to 
germline variants, 'VarScan 2' is able to identify som- 
atic mutations and somatic CNVs. Germline variants 
are called using a heuristic method as well as a stat- 
istical test based on the number of aligned reads sup- 
porting each allele. 

CRISP, GATK, SAMtools, SNVer and VarScan 2 
were tested with the KTS data set. The tools identi- 
fied -24000, 49 000, 22 000, 22 000 and 34 000 
SNPs as well as 259, 1959, 234, 332 and 1896 
INDELs, respectively. Figure 2A depicts the overlap 
of identified variants in a Venn diagram. More than 
13 000 of identified SNPs are overlapping between 
all used tools, in contrast to INDELs where none is 
shared with all applications. CRISP reports the 
lowest number of non-overlapping SNPs (8), 
whereas >23% of SNPs are only identified by 
GATK. VarScan 2 and GATK share the largest 
number of overlapping INDELs (~57%) in contrast 
to SNVer, where 99% of reported INDELs are not 
identified by any other tool. In summary, tools differ 
widely regarding called INDELs and show larger 
agreement in identified SNPs. It is however note- 
worthy that all INDELs called by CRISP were also 
identified by GATK (96% by VarScan2). 

Somatic callers 

In addition to GATK, SAMtools and VarScan 2, we 
evaluated another somatic caller: 'SomaticSniper'. 
'SoniaticSniper' [103] is a command-line application 
to identify SNPs that are different between tumor/ 
normal pairs. It calculates a somatic score that 



predicts the probability that tumor and normal geno- 
types are different. 

GATK, SAMtools, SomaticSniper and VarScan 2 
were tested using the whole-exome tumor data set. 
As SAMtools does not clearly specify how to define 
the input for tumor and normal data sets and the 
generated result file lacked infomiation about the 
somatic score, SAMtools was not further considered 
for calling somatic mutations. GATK is only capable 
of calling INDELs and reported 151 mutations. 
SomaticSniper could identify 6926 SNPs and 
VarScan 2 was able to detect 1685 SNPs and 324 
INDELs. Compared with germline results, the 
agreement of the identified variants is small (depicted 
in Figure 2B). 

Copy number variations identification 

We tested four tools for CNV identification: (i) 
'CNVnator' [104], a CNV discovery and genotyping 
tool for whole-genome sequencing data which uses 
read-depth analysis based on mean shift; (ii) 
CONTBJV [105], a tool for detecting CNVs in 
whole-exome data. The application calls copy 
number gains and losses for each specified target 
region; (iii) 'ExomeCNV [84], a R package for 
the identification of CNVs and loss of heterozygosity 
from whole-exome sequencing data. The tool works 
best when paired samples (i.e. tumor/normal pairs) 
are available and (iv) 'RDXplorer' [106], a tool for 
detecting CNVs in human whole-genome sequen- 
cing data. It uses read depth coverage and detects 
CNV based on the event-wise testing algorithm. It 
should be noted that due to numerous dependencies, 
the installation of this tool is challenging for 
non-experienced users. 

CNVnator and RDXplorer were tested using the 
simulated CNV data set with known number of 
CNVs (42). CNVnator was able to call 39 CNVs; 
RDXplorer identified 4 CNVs (filtered results). This 
results in a precision and recall of 0.74 and 0.69, as 
well as 1 and 0.1 for CNVnator and RDXplorer, 
respectively. Figure 2C depicts the agreement be- 
tween known (cnv_sini) and predicted CNVs. 

As CONTRA and ExomeCNV were designed for 
the analysis of whole-exome sequencing data, the 
'WES' data set was used for testing. ExomeCNV 
identified 137 CNVs, whereas CONTRA was able 
to call 8940 CNVs using a P-value < 0.05. Three out 
of those were reported as significant (adjusted 
P- value < 0.05), whereby the usage of a less-conser- 
vative threshold (adjusted P-value<0.1) resulted in 
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A (Germline callers) 



SAMtools 



GATK 



SNVer 




B (Somatic callers) 

VarScan 2 



GATK 



CRISP 



C (CNV identification tools) 

cnv Sim 



VarScan2 




SomaticSniper 



RDXplorer 



CNVnator 




D (Exonne CNV identification tools) 

ExomeCNV CONTRA 




Figure 2: Venn diagrams showing the number of identified variants for tested germline (A), somatic (B), CNV (C) 
and exome CNV (D) tools. The depicted numbers in (A) and (B) report identified SNPs and INDELs. Venn diagram 
(C) shows the overlap between known (cnv^im) and predicted CNVs. Figure (D) illustrates the overlap between 
CONTRA and ExomeCNV The intersection numbers were adjusted to reflect that 10 CNVs detected by 
CONTRA are located within 3 CNVs reported by ExomeCNV. 



11 CNVs. Out of the remaining 11, 10 CNVs were 
located within 3 CNVs reported by ExomeCNV. 
The direction (gain/loss) reported by both tools was 
the same for all of the overlapping variants. 

Structural variants identification 

'BreakDancer' [107] predicts five types of structural 
variants: insertions, deletions, inversions, inter- and 
intra-chromosomal translocations. 

'Breakpointer' [108] is a conmiand-line tool espe- 
cially designed for the location of potential intra- 
chromosomal sequence breakpoints from single end 
reads. It uses a heuristic method to identify local 
mapping signatures created by INDELs longer than 
the read's length or other SVs. 



'CLEVER' [109] identifies SVs in genomes from 
paired-end sequencing reads. It uses an insert size- 
based approach, which takes all reads into account. 
The tool offers an intuitive script with default par- 
ameters to facilitate usability. 

'GASVPro' [110] represents the probabilistic ver- 
sion of the original GASV algorithm [111] and de- 
tects SVs from paired-end data. 

'SVMerge' [112] is a software suite that integrates 
results from several different SV callers and performs 
subsequent validation and refinement of identified 
breakpoints. Unfortunately, the software suite is 
not provided as a ready-to-use virtual box or 
cloud implementation, which would enhance the 
usability. 
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BreakDancer was tested with the WGS tumor/ 
normal data set and identified 6219 deletions, 7 in- 
versions, 17 303 intra- and 5037 inter-chromosomal 
translocations. AH other inspected SV callers were 
applied on the WGS tumor data set only. 
Breakpointer determined 2.7 million possible intra- 
chromosomal breakpoints in total. CLEVER pro- 
duced 2.6 million entries. 'GASVPro-HQ' identified 
3941 raw deletions and 407 raw inversions, which 
were automatically reduced to 2529 and 207, 
respectively. Although GASVPro commands oifer 
to call inter-chromosomal variations, the feature is 
not yet supported by its pipeline scripts. Testing of 
SVMerge was aborted, as installation dependencies 
are hkely beyond the IT skill levels of many bench 
scientists. 

In summary, all selected tools for variant identifi- 
cation were successfully installed and tested with the 
appropriate data sets. As can be seen in Figure 2 and 
Table 1, there were considerable differences in the 
obtained results. The agreement between the callers 
was larger for SNPs compared with INDELS and 
larger for germline than for somatic mutations, re- 
spectively. Furthermore, the results show even 
bigger discrepancy for the identified CNVs and 
SVs (Figure 2 and Table 1). 

There are several possible explanations for this ob- 
servation. First, there are a number of parameters and 
thresholds that can be adjusted for the specific tools 
and hence, influence the outcome. Early methods 
for genotype calling are based on fixed cut-offs 
whereas recent methods such as GATK or 
SAMtools provide measures of statistical uncertainty 
when calling genotypes and hence, might improve 
the accuracy (see also excellent review on genotype 
and SNP calling by Nielsen et al. [25]). Second, the 
underlying statistical models are divergent and harbor 
different assumptions, which are often difficult or 
impossible to evaluate experimentally. For example, 
the ploidy, the degree of tumor heterogeneity and 
the percentage of normal cells in the sample are in 
most cancer studies difficult to estimate. Thus, using 
a statistical model based on a diploid genome would 
in this case inevitably lead to erroneous results. 
Third, in order to assess the performance of newly 
developed tools, the authors often rely on simulated 
data to benchmark their method. However, due to 
the complexity of the underlying error model and 
the varying properties of NGS data, simulated data 
may be neglecting certain error cases and make as- 
sumptions that are not valid. Furthermore, the use of 



simulated data for the development of a new tool 
might tailor the method to specific data sets and 
hence, severely hamper the perfomiance when 
using real NGS data. Further studies are required 
in order to benchmark the performance of the vari- 
ant callers either by using an exome or genome-scale 
size simulation data set or a large number of experi- 
mentally evaluated variants by Sanger re-sequencing 
as shown recently [113]. We also strongly believe 
that further theoretical studies are necessary to de- 
velop new and/or improved variant callers. As the 
number of sequenced genomes/exomes is steadUy 
increasing, specific statistical models wlU become 
available for certain diseases or cohorts. 

Variant annotation 

Below we describe the evaluation results of the eight 
selected tools (see also Table 2). 

'ANNOVAR' [114] is a conmiand-line tool for 
up to date functional annotation of various genomes, 
supporting SNPs, INDELs, block substitutions as 
well as CNVs. The tool provides a wide variety of 
different annotation techniques, organized in the 
categories gene-based, region-based and fdter-based 
annotation. The tool depends on several databases, 
which need to be downloaded individually. This ap- 
proach ensures that the correct database version is 
used and the download of large unnecessary data 
sets is avoided. 

'AnnTools' [1 15] is a command-line tool for ana- 
lyzing SNPs, INDELs and CNVs found in both 
coding and non-coding regions. The program relies 
on 15 different widely used data sources such as 
dbSNP, which are regularly updated. A database 
update tool is provided to help keep the local data- 
base up-to-date. 

'NGS— SNP' [116] is a collection of Perl scripts for 
the annotation of SNVs using the Ensembl database 
as a reference. The program uses the online version 
of the Ensembl database, which has the advantage 
that the reference database is always up-to-date. In 
comparison to other tools, NGS— SNP took several 
days to complete the annotation process, which is 
likely due to the latency of querying the online data- 
base during the tool's execution. 

The 'SeattleSeq' Annotation server (http://snp.gs. 
washington.edu/SeattleSeqAnnotation) provides a 
web application for annotating human SNPs and 
INDELs. In contrast to most other web-based anno- 
tation services, SeattleSeq provides the possibility to 
directly upload input files in various formats for batch 
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analysis of multiple variants. The analysis of both 
hgl8/GRCh36 and hgl9/GRCh37 data is sup- 
ported by providing a separate input page for each 
genome version. As variant annotation is performed 
on a remote server, the tool might be interesting for 
research groups without dedicated hardware for data 
analysis. 

'Sequence variant analyzer (SVA)' [117] is a 
stand-alone tool with a GUI dedicated to annotating 
and visualizing variants identified by NGS experi- 
ments. The tool includes its own genome browser 
and supports annotation of SNPs, INDELs and 
CNVs. Setting up the program and writing the pro- 
ject configuration file might be difficult without IT 
expertise. As the hardware requirements are rather 
demanding, the use of a powerful workstation is rec- 
ommended (>48 GB of RAM; >1 TB free hard disk; 
quad core processor). The program could not cor- 
rectly annotate VCF files, which were using a UCSC 
hg reference version. 

'snpEff [118] is a popular variant annotation, 
which has also been integrated within Galaxy and 
GATK. In addition to SNPs, the tool also supports 
INDELs and multiple-nucleotide polymorphisms. 
snpEff identifies various different effects, which are 
categorized into four classes (high, moderate, low 
and modifier) by their functional impact. 

'VARIANT' [119] can detect the functional prop- 
erties of SNVs in coding as well as non-coding re- 
gions. The program can be used via a web interface, 
as a command-line tool or as a web service. Since the 
command-line tool also makes use of the remote 
VARIANT database, the tasks of maintaining and 
searching of databases are provided by the authors. 
Therefore, the command-line version of the tool is 
usable without profound IT expertise and can be 
executed on regular office PCs. The web interface 
features anonymous usage and allows creation of per- 
sonal accounts, which enables users to view their 
uploaded input files and analysis results once they 
log in. 

'Variant effect predictor (VEP)' [120] is Ensembl's 
own functional annotation tool, formerly known as 
SNP effect predictor. The tool can be used either by 
a web interface, as command-line tool or via a Perl 
API. The web interface version is aimed at users 
analyzing smaller sets of variants, as it is only capable 
of processing 750 variants per file. 

In summary, all annotation tools provide a set of 
general (e.g. mutation introduces a new stop codon) 
or detailed (e.g. mutation hits an exon) attributes for 



each identified mutation. These properties can be 
used to assess the potential impact of each mutation. 
All tested applications provide links to one or 
more public databases of known mutations. Four 
of the tested tools provide prediction scores 
reflecting their estimated deleterious impact. 
ANNOVAR uses six different scores: GERP++ 
[121], LRT [122], MutationTaster [123], PolyPhen 
[124], PhyloP conservation [125] and SIFT [126]. 
SeattleSeq supplies four scores: GERP [127], 
Grantham [128], phastCons [129] and PolyPhen. 
NGS— SNP and VEP provide three scores: Condel 
[130], PolyPhen and SIFT. These scores are com- 
puted based on various different approaches, such as 
sequence homology, evolutionary conservation, pro- 
tein structure or statistical prediction based on 
known mutations. Due to the possibility of precal- 
culating several scores (GERP, PolyPhen and SIFT) 
for every position in the genome, tools such as 
ANNOVAR or SeattleSeq make use of databases 
to look up the requested annotation. Hence, the 
annotation process itself is very fast as no on the fly 
calculation is needed and similar results are reported. 
It should be noted that results are often difficult to 
interpret for inexperienced users. Furthermore, 
among the tested tools, only stand-alone tools 
(ANNOVAR, AnnTools and SVA) were able to 
annotate identified CNVs. 

Visualization 

All of the evaluated genome browsers have in 
common that they are capable of displaying numer- 
ous ID tracks which contain information about the 
reference genome, the transcriptome, aligned reads, 
found mutations, annotations collected from public 
data sources or other data types important for the 
correct interpretation of results [131]. The two 
types of genome browsers, namely web-based appli- 
cations and stand-alone tools, as well as CNV/SV 
visuaHzation tools are given in Table 3. 

Web-based applications 

'The Ensembl Genome Browser' [132] provides a 
variety of reference genomes in combination with 
numerous local annotation sets and external re- 
sources. This genome browser is able to display con- 
tinuative information when searching for a specific 
entity. Users can add their own tracks either by up- 
loading the data file (restricted to 5 MB) or by spe- 
cifying a remote URL or the distributed annotation 
system (DAS). It is not necessary to create an 
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lable 3: Visualization 



Name OS BAM/ VCF Other formats Annotation 

SAM 



Web-based genome browsers 



EnsembI Genome Browser 


web interface 


Yes 


Yes 


BED, bedGraph, GFF, GTF, PSL, WIG, BAM, bigWig 


Yes 


UCSC Genome Browser 


web interface 


Yes 


Yes 


BED, bigBed, bedGraph, GFF, GTF, WIG, bigWig, MAF, SNP, PSL 


Yes 


VEGA Genome Browser 


web interface 


Yes 


Yes 


BED, bedGraph, bigBed, bigWig, GBrowse, GFF, GTF, PSL, WIG 


Yes 


Stand-alone genome browsers 












Artemis 


Lin, Mac, Win 


Yes 


Yes 


BCF, FASTA 


Yes 


Integrative Genomics 


Lin, Mac, Win 


Yes 


Yes 


SNR GFF, BED, IGV TAB, WIG, (>30 formats) 


Yes 


Viewer (IGV) 












Savant 


Lin, Mac, Win 


Yes 


Yes 


FASTA, BED, GFF, WIG, TAB 


Yes 


CNV and SV visualization 












Circos 


Lin, Mac, Win, 


No 


No 


GFF, CSV 


Yes 



web interface 



This table holds genome browsers as well as tools producing circos plots, whereby genome browsers were split into web-based applications, access- 
ible using a web browser and stand-alone tools with a graphical user interface. All genome browsers use tracks to display different features like ref- 
erence genome, annotations or experimental data. Further visualization tools can be found in Supplementary Tables S8 and S9. OS, operating 
system; Lin, Linux; Mac, Mac OS X; Win, Windows; BAM, Binary SAM; BCF, Binary VCF; BED, Browser Extensible Data, a text-based file format; 
bedGraph, file format allowing the display of continuous-valued data in track format; bigBed, compressed, binary-indexed BED file; bigWig, com- 
pressed, binary indexed WIG file; FASTA, text-based format for representing nucleotide sequences; GBrowse, Gbrowse proprietary format; GFF, 
General Feature Format; GTF, Gene Transfer Format; IGV, Integrative Genomics Viewer format; MAF, Multiple Alignment Format; PSL, pattern 
space layout; SAM, Sequence Alignment/Map; SNP, Personal Genome SNP format; TAB, tab-delimited file; VCF, Variant Call Format; WIG, Wiggle 
Track Format. 



Ensembl account to attach or upload data, but it can 
be used to save specific information for later use. 

'The University of Santa Cruz (UCSC) Genome 
Browser' [133] offers several different annotations, 
including phenotype and disease annotations. User- 
specific data can either be uploaded or remotely 
hosted and provided by an URL. The UCSC data- 
base comprises almost 1800 annotation tracks for the 
human genome GRCh37/hgl9. 

'The Vertebrate Genome Annotation (Vega) 
Genome Browser' [134] is built upon code of the 
Ensembl Genome Browser. Data available in Vega is 
based on a fi-eezed Ensemble release that has under- 
gone manual annotation and curation by the Havana 
group at the Welcome Trust Sanger Institute. The 
genome browser offers standardized gene classifica- 
tion encompassing pseudogenes, non-coding tran- 
scripts and PolyA sites. Vega contains annotations 
from different species for comparative analysis, but 
currently human chromosomes 18 and 19 are stiU 
outstanding. 

In general, web-based genome browsers provide 
different reference genomes already integrated with a 
variety of annotation tracks. A drawback is the 
required data handling, as the user has to provide 
URLs for external files, which requires uploading 
large amount of data to either a local web server or 
a remote location. In addition, files need to be 
packed, sorted and indexed before they can be used. 



Stand-alone applications 

'Artemis' [135] incorporates BamView [136] for 
viewing aligned reads and is able to display different 
properties of a loaded sequence. It features two se- 
quence windows, which can be used to display the 
same sequence at different zoom levels. Artemis 
allows filtering of variants based on different proper- 
ties in real time and supports the export of calculated 
properties such as SNP density and read counts. 

'The Integrative Genomics Viewer (IGV)' [137] 
supports loading of tracks, reference genome as well 
as annotation data from local or remote data sources. 
Furthermore, the tool uses the DAS system and is 
capable of handling >30 different fde formats. IGV is 
also capable of displaying sample metadata (e.g. gen- 
der and age) as a heatmap, which can be shown ad- 
jacent to the sample's corresponding track. The IGV 
tool package provides functionality for tiling, count- 
ing, sorting and indexing of several file formats. In 
addition to the stand-alone GUI version, IGV can be 
launched and controlled using scripts enabling the 
generation of different image snapshots at once. 

'Sequence Annotation and Visualization and 
ANalysis Tool (Savant)' [138] uses a modular dock- 
ing framework, where each module appears as a sep- 
arate window. Local as well as remote data sets can 
be loaded into Savant, and it supports simple file 
fomiatting functionality (e.g. VCF zipping and 
indexing) . 
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All tested stand-alone genome browsers are imple- 
mented in Java, and therefore run on all operating 
systems with an installed JVM. The applications are 
able to load different reference genomes and com- 
bine them with tracks holding experimental data or 
annotations. The stand-alone genome browsers 
showed excellent perfomiance for zooming and pan- 
ning and were able to visualize SNPs and INDELs. 
Moreover, they generally feature a rich user interface 
and sometimes support the inclusion of custom ana- 
lytical modules using a plug- in architecture (Savant). 
Web-based genome browsers facilitate displaying of 
data in context with already existing annotations and 
data tracks without the need for downloading or 
installing additional information. Moreover, due to 
their universal accessibility, they support collabor- 
ation within or between different institutions. 
Thus, given the above characteristics, the selection 
of a genome browser for prospective users and for 
specific criteria is straightforward. 

CNVand SV visualization 

'Circos' is a widely used command-line tool written 
in Perl for the visualization of similarities and differ- 
ences of genomes. Although it is capable of visualizing 
arbitrary data, it is clearly intended for multi-sample 
genomic data. Extensive documentation and tutorials 
are provided which are helpful to correctly set up all 
configuration files. In addition to the simple 
command-line interface, an online version of the 
tool is available allowing the upload of input data 
and configuration files using a simple web interface. 
Circos is very flexible and allows users to display data 
as scatter, line and histogram plots, as well as heat 
maps, tiles, connectors and text. Due to this flexibility, 
the learning curve for new users is steep and it might 
take some time to explore all available options. 

Analysis pipelines and workflow systems 

We evaluated three analytical pipelines ('HugeSeq', 
'SIMPLEX' and 'TREAT') and three workflow 
systems ('Galaxy', 'LONI' and 'Tavema'). 

'HugeSeq' [139] is a fuUy integrated pipeline for 
NGS analysis from aligning reads to the identification 
and annotation of variants (SNPs and INDELs for 
whole-genome and whole-exome sequencing data 
as well as CNVs and SVs for whole-genome data 
only). It consists of three main parts: (i) preparing 
and aligning reads, (ii) combining and sorting reads 
for parallel processing of variant calling and (iii) vari- 
ant calling and annotating. The pipeline accepts as an 



input reads in FASTA or FASTQ format and outputs 
identified variants in VCF format, and SVs and 
CNVs in OFF format. Identified variants are further 
processed with ANNOVAR to include additional 
annotations. 

'SIMPLEX' [140] is an autonomous analysis pipe- 
line for the analysis of NGS exome data, covering 
the workflow from sequence alignment to SNP/ 
INDEL identification and variant annotation. It sup- 
ports input from various sequencing platforms and 
exposes all available parameters for customized 
usage. It outputs summary reports and annotates de- 
tected variants with additional information for dis- 
crimination of silent mutations from variants that are 
potentially causing diseases. SIMPLEX is provided as 
a ready-to-use virtual machine and can be used in 
the Amazon EC2 Cloud. 

'TREAT' [141] is a pipehne where each of the 
three modules (alignment, variant caUing and variant 
annotation) can be used separately or as an integrated 
version for an end-to-end analysis. It provides a rich 
set of annotations, HTML summary report and vari- 
ant reports in Excel format. TPJiAT can be down- 
loaded for local use (requires large main memory) or 
launched on the Amazon EC2 Cloud. The pipeline 
currently provides only the human reference 
genome hgl8. 

'Galaxy' [142] is a web-based platform where the 
user can perform, reproduce and share complete ana- 
lyses. Pipelines are represented as a history of user 
actions, which can be stored as a dedicated work- 
flow. It contains scripts for over a 100 analysis tools 
and users can add new tools (requiring basic inform- 
atics skills) and share all analysis steps and pipelines. 
Workflows are stored directly in a dedicated data- 
base, and jobs can be distributed onto a high- 
performance computing infrastructure. Researchers 
can use the online version, install a local version on 
their server system or run Galaxy on the Amazon 
EC2 Cloud. 

'LONI' [143] is a workflow processing application 
that can be used to wrap any executable for use in 
the LONI system. In order to access the tools, users 
need to connect to either public or private pipeline 
servers. Workflows can be divided into several mod- 
ules and researchers can either use existing modules 
or create new ones from scratch. Several NGS ana- 
lysis workflows have been pubhshed that are ready to 
be imported and used. 

The 'Tavema' [144] workflow management 
system stores workflows in a format that is simple 
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to share and manipulate outside the editor. Initially, 
it was not shipped with any prepackaged NGS ana- 
lysis tools, and integrating tools required some pro- 
gramming experience. However, workflows for 
NGS have been made public that use a specific 
queuing system to distribute jobs to a computational 
cluster. These workflows can be downloaded and 
imported, and additionally edited. 

In summary, the analytical pipelines provide 
out-of-the box solutions for analyzing whole- 
genome or whole-exome sequencing data. How- 
ever, due to the fixed choice of included tools they 
do not provide as much flexibility as the workflow 
systems. In contrary, the tested workflow systems 
offer greater flexibility for integrating analytical 
tools but often require to manually install analysis 
tools and may not offer predefined solutions to ana- 
lyze sequencing data. 



DISCUSSION 

In our report, we provide a comprehensive survey of 
tools available for the analysis of whole-genome/ 
whole-exome data covering all analytical steps: qual- 
ity assessment, alignment, variant identification, vari- 
ant annotation and visualization. The information 
we provide represents a valuable guideline for both 
an expert in the field and a less-experienced user to 
select the appropriate tools for a specific application 
and assemble an optimal analytical pipeline. The ana- 
lysis of NGS data is a fast moving field and recom- 
mendations which tools to use might quickly 
change. Nevertheless, we make the following gen- 
eral recommendations. 

First, tools for quality assessment and alignment 
matured to a great extent and the choice is rather 
straightforward. Depending on the supporting plat- 
form, several tools are available that report properties 
and quality of obtained sequencing reads. Among the 
evaluated tools, FASTQC and NGSQC are capable 
of providing reports for FASTQ as well as the ABI 
SOLiD file format. Based on the obtained quality 
results, tools can be used to trim or remove reads 
that do not suffice the predefined quaHty standards. 
We recommend tools that are in active development 
and provide trimming as well as filtering functional- 
ity such as FASTX-Toolkit or PRINSEQ. Similarly, 
alignment software suites have been constantly im- 
proved over the past few years and several tools are 
widely used and supported by a large user commu- 
nity. For example, BWA, Bowtie and SOAP have 



matured greatly and are actively used in NGS data 
analysis projects. It is noteworthy that support for the 
ABI SOLiD platform has been dropped in recent 
versions of some of the alignment tools (e.g. 
BWA> 1.6.0, Bowtie2). 

Second, for variant identification, we suggest a 
consensus approach, e.g. running CRISP, GATK 
and SAMtools on the same data set. It has been re- 
ported that no single approach comprehensively cap- 
tures all genetic variations, and therefore, several 
variant identification tools should be appHed. 
Variants will then only be considered if they fulfill 
certain criteria (e.g. identified by a minimum 
number of independent variant callers) [145]. 
However, applying such stringent criteria inevitably 
filters out true positives. If the candidate variant is 
missed, the search can be broadened and additional 
biological criteria can be applied. It is well known 
that the used library preparation method and the 
selected sequencing technology directly influence 
the reported outcome. It is therefore good practice 
to consider certain metrics (e.g. strand bias) and use 
heuristic approaches to separate true positives from 
false positives. Finally, additional aspects worth con- 
sidering are the availability of up-to-date documen- 
tation and the existence of an active user 
community. 

Third, the choice of an annotation tool is largely 
dependent on the desired selection of variant anno- 
tations. Four of the evaluated tools (ANNOVAR, 
SeattleSeq, NGS— SNP and VEP) include prediction 
scores, which are used to reflect the estimated dele- 
terious impact of a particular variant. Several tools 
have been published that generate multiple vari- 
ant annotations at once, which can be tremendously 
useful for shortening the analysis time 
(e.g. ANNOVAR, SeatdeSeq or VARIANT). 
Furthermore, tools should be selected that regularly 
update the included information of publicly available 
databases. In addition, users should consider possible 
security issues when using web-based annotation 
tools. 

Fourth, visualization tools are usually easy to install 
and it might therefore be a valid approach to test 
different software suites. In addition, some visualiza- 
tion programs such as IGV, Savant or the UCSC 
Genome Browser offer the possibility to obtain an- 
notation tracks — such as reference genomes — which 
facilitates the usage for inexperienced users. Further 
consideration for the choice of the visuaHzation tool 
is the possibility to handle the data format used in 
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previous analysis methods for reporting variants and 
variant annotations. When choosing web-based sys- 
tems, users should be aware of security and legal 
issues that might arise. 

Prioritization of candidate variants 

With the use of whole-exome and whole-genome 
sequencing, the challenge of the 'next-generation 
genetics' is one of narrowing down the hst of can- 
didate variants and interpreting remaining variants 
within a biological context [146]. A widely used ap- 
proach to substantially reduce the candidate list is to 
exclude known variants which are present in public 
SNP databases, published studies or in-house data- 
bases as it is assumed that common variants represent 
harmless variations [147]. Another way to narrow 
down the genomic search space is the use of pedigree 
information, i.e. sequencing distantly related individ- 
uals with the phenotype of interest which might be 
suiiicient to identify the causing mutation [148]. This 
approach is also helpful to identify the cause of 
common disorders that are genetically highly hetero- 
geneous. However, since each generation introduces 
up to 4.5 deleterious mutations [149], it might be as 
well that a de novo mutation is causing the disease. 
Furthermore, without family information, it is often 
difficult to predict whether a disease follows a reces- 
sive or dominant inheritance [147]. 

In the case of cancer genomes, all potential vari- 
ations need to be considered, including germline sus- 
ceptibihty loci, somatic SNPs and INDELs, CNVs 
and SVs [150]. A common method is to use pairwise 
comparison of tumor and nomial tissues from the 
same individual to distinguish somatic coding muta- 
tions [10] and to identify driver mutations [18, 19]. 
In addition, tools have been published that can de- 
termine significant mutations in cancer by using 
groups of tumor/normal sample pairs, clinical data 
and identified variants from the cohort under inves- 
tigation [151]. 

All prioritization methods have in common that 
researchers risk removing the pathogenic variant 
[147] which is reflected by the currently reported 
high rates of false-positive and false-negative predic- 
tions [152, 153]. Therefore, prediction tools should 
be used with caution since they may not be reliable 
enough to infer a definitive diagnosis [154]. The use 
of different prioritization approaches [147] and the 
combination of prediction results with phenotypic 
and pedigree data as well as data from databases 



might be the best approach to determine the poten- 
tial cause of the disease under investigation [154]. 

Outlook 

As sequencing costs continue to fall, we will experi- 
ence a shift from sequencing human whole-exomes 
to whole-genomes, which will create the need for 
even more sophisticated methods to find the muta- 
tions causal for diseases [10]. We believe that future 
developments of workflow and pipeline systems will 
tremendously facilitate the analysis of NGS data, as 
they do not require complex installation routines and 
necessary data conversion steps from end-users. 

SUPPLEMENTARY DATA 

Supplementary data are available online at http:// 
bib.oxfordjoumals.org/. 
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